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Abstract 

This  document  is  intended  for  computer  scientists  who  would  like  to  try  out  a  Markov  Chain  Monte 
Carlo  (MCMC)  technique,  particularly  in  order  to  do  inference  with  Bayesian  models  on  problems  related 
to  text  processing.  We  try  to  keep  theory  to  the  absolute  minimum  needed,  though  we  work  through  the 
details  much  more  explicitly  than  you  usually  see  even  in  “introductory”  explanations.  That  means  we’ve 
attempted  to  be  ridiculously  explicit  in  our  exposition  and  notation. 

After  providing  the  reasons  and  reasoning  behind  Gibbs  sampling  (and  at  least  nodding  our  heads  in  the 
direction  of  theory) ,  we  work  through  an  example  application  in  detail — the  derivation  of  a  Gibbs  sampler  for 
a  Naive  Bayes  model.  Along  with  the  example,  we  discuss  some  practical  implementation  issues,  including 
the  integrating  out  of  continuous  parameters  when  possible.  We  conclude  with  some  pointers  to  literature 
that  we’ve  found  to  be  somewhat  more  friendly  to  uninitiated  readers. 
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Abstract 

This  document  is  intended  for  computer  scientists  who  would  like  to  try  out  a  Markov  Chain  Monte  Carlo 
(MCMC)  technique,  particularly  in  order  to  do  inference  with  Bayesian  models  on  problems  related  to  text 
processing.  We  try  to  keep  theory  to  the  absolute  minimum  needed,  though  we  work  through  the  details 
much  more  explicitly  than  you  usually  see  even  in  “introductory”  explanations.  That  means  we’ve  attempted 
to  be  ridiculously  explicit  in  our  exposition  and  notation. 

After  providing  the  reasons  and  reasoning  behind  Gibbs  sampling  (and  at  least  nodding  our  heads  in  the 
direction  of  theory) ,  we  work  through  an  example  application  in  detail — the  derivation  of  a  Gibbs  sampler  for 
a  Naive  Bayes  model.  Along  with  the  example,  we  discuss  some  practical  implementation  issues,  including 
the  integrating  out  of  continuous  parameters  when  possible.  We  conclude  with  some  pointers  to  literature 
that  we’ve  found  to  be  somewhat  more  friendly  to  uninitiated  readers. 

1  Introduction 

Markov  Chain  Monte  Carlo  (MCMC)  techniques  like  Gibbs  sampling  provide  a  principled  way  to  approximate 
the  value  of  an  integral. 

1.1  Why  integrals? 

Ok,  stop  right  there.  Many  computer  scientists,  including  a  lot  of  us  who  focus  in  natural  language  processing, 
don’t  spend  a  lot  of  time  with  integrals.  We  spend  most  of  our  time  and  energy  in  a  world  of  discrete  events. 
(The  word  bank  can  mean  (1)  a  financial  institution,  (2)  the  side  of  a  river,  or  (3)  tilting  an  airplane.  Which 
meaning  was  intended,  based  on  the  words  that  appear  nearby?)  Take  a  look  at  Manning  and  Schuetze 
[14],  and  you’ll  see  that  the  probabilistic  models  we  use  tend  to  involve  sums,  not  integrals  (the  Baum- Welch 
algorithm  for  HMMs,  for  example) .  So  we  have  to  start  by  asking:  why  and  when  do  we  care  about  integrals? 

One  good  answer  has  to  do  with  probability  estimation.1  Numerous  computational  methods  involve 
estimating  the  probabilities  of  alternative  discrete  choices,  often  in  order  to  pick  the  single  most  probable 
choice.  As  one  example,  the  language  model  in  an  automatic  speech  recognition  (ASR)  system  estimates 
the  probability  of  the  next  word  given  the  previous  context.  As  another  example,  many  spam  blockers  use 
features  of  the  e-mail  message  (like  the  word  Viagra,  or  the  phrase  send  this  message  to  all  your  friends)  to 
predict  the  probability  that  the  message  is  spam. 

Sometimes  we  estimate  probabilities  by  using  maximimum  likelihood  estimation  (MLE).  To  use  a  standard 
example,  if  we  are  told  a  coin  may  be  unfair,  and  we  flip  it  10  times  and  see  HHHHTTTTTT  (H=heads, 
T=tails),  it’s  conventional  to  estimate  the  probability  of  heads  for  the  next  flip  as  0.4.  In  practical  terms, 
MLE  amounts  to  counting  and  then  normalizing  so  that  the  probabilities  sum  to  1. 

1This  subsection  is  built  around  the  very  nice  explication  of  Bayesian  probability  estimation  by  Heinrich  [7]. 
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Figure  1:  Probability  of  generating  the  coin-flip  sequence  HHHHTTTTTT,  using  different  values  for 
P(heads)  on  the  x-axis.  The  value  that  maximizes  the  probability  of  the  observed  sequence,  0.4,  is  the 
maximum  likelihood  estimate  (MLE). 


count  (H)  4 

count  (H)  +  count  (T)  10 


(1) 


Formally,  MLE  produces  the  choice  most  likely  to  have  generated  the  observed  data. 

In  this  case,  the  most  natural  model  p  has  just  a  single  parameter,  7 r,  namely  the  probability  of  heads 
(see  Figure  l).2  Letting  X  =  HHHHTTTTTT  represent  the  observed  data,  and  y  the  outcome  of  the  next 
coin  flip,  we  estimate 


KmLE 

=  argmaxP(A’|7r) 

7 T 

(2) 

P{y\x) 

~  P{v\kmle) 

(3) 

On  the  other  hand,  sometimes  we  estimate  probabilities  using  maximum  a  posteriori  (MAP)  estimation. 
A  MAP  estimate  is  the  choice  that  is  most  likely  given  the  observed  data.  In  this  case, 


™MAP 

=  argmax  P(7ij  X) 

P(X\n)P(  tt) 

-  argrX  P(X) 

=  argmax  P(A’|7t)P(7t) 

(4) 

p(y\x) 

~  P(y \kmap) 

(5) 

In  contrast  to  MLE,  MAP  estimation  applies  Bayes’s  Rule,  so  that  our  estimate  (4)  can  take  into  account 
prior  knowledge  about  what  we  expect  7r  to  be  in  the  form  of  a  prior  probability  distribution  P(-7r).3  So, 

2  Specifically,  p  models  each  choice  as  a  Bernoulli  trial,  and  the  probability  of  generating  exactly  this  heads-tails  sequence  for 
a  given  7r  is  7r4(l  —  -nj®.  If  you  type  Plot  [p~4(l-p)  “6 ,  {p,0, 1}]  into  Wolfram  Alpha,  you  get  Figure  1,  and  you  can  immediately 
see  that  the  curve  tops  out,  i.e.  the  probability  of  generating  the  sequence  is  highest,  exactly  when  p  =  0.4.  Confirm  this  by 
entering  derivative  of  p~4(l-p)~6  and  you’ll  get  |  =  0.4  as  the  maximum.  Thanks  to  Kevin  Knight  for  pointing  out  how 
easy  all  this  is  using  Wolfram  Alpha.  Also  see  discussion  in  Heinrich  [7],  Section  2.1. 

3We  got  to  (4)  from  the  desired  posterior  probability  by  applying  Bayes’s  Rule  and  then  ignoring  the  denominator  since  the 

argmax  doesn’t  depend  on  it. 
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for  example,  we  might  believe  that  the  coin  flipper  is  a  scrupulously  honest  person,  and  choose  a  prior 
distribution  that  is  biased  in  favor  of  7r  =  0.5.  The  more  heavily  biased  that  prior  distribution  is,  the  more 
evidence  it  will  take  to  shake  our  pre-existing  belief  that  the  coin  is  fair.4 

Now,  MLE  and  MAP  estimates  are  both  giving  us  the  best  estimate,  according  to  their  respective 
definitions  of  “best.”  But  notice  that  using  a  single  estimate  —  whether  it’s  ttmle  or  it  map  —  throws  away 
information.  In  principle,  7r  could  have  any  value  between  0  and  1;  might  we  not  get  better  estimates  if  we 
took  the  whole  distribution  P(ir\X)  into  account,  rather  than  just  a  single  estimated  value  for  7 r?  If  we  do 
that,  we’re  making  use  of  all  the  information  about  7r  that  we  can  wring  from  the  observed  data,  X . 

The  way  to  take  advantage  of  all  that  information  is  to  calculate  an  expected  value  rather  than  an  estimate 
using  the  single  best  guess  for  7 r.  Recall  that  the  expected  value  of  a  function  f(z ),  when  z  is  a  discrete 
variable,  is 


E[f{z)}  =  £/(z)p(*)-  (6) 

Z(zZ 

Here  Z  is  the  set  of  discrete  values  z  can  take,  and  p(z)  is  the  probability  distribution  over  possible  values 
for  2.  If  z  is  a  continuous  variable,  the  expected  value  is  an  integral  rather  than  a  sum: 


E[f{z)\  =  Jf(z)p(z)dz.  (7) 

For  our  example,  z  =  n,  the  function  f  we’re  interested  in  is  f(z)  =  P{y 1 7r) ,  and  the  distribution  over 
which  we’re  taking  the  expectation  is  P(tt\X),  i.e.  the  whole  distribution  over  possible  values  of  7r  given  that 
we’ve  observed  X.  That  gives  us  the  following  expected  value  for  the  posterior  probability  of  y  given  X: 


P(y\X)  =  j  P(y\ir)P(Tr\X)dn 


(8) 


where  Bayes’s  Rule  defines 


P(tt\X) 


P(X\7r)P(n) 

P{X) 


P(X\n)P(n) 
lnP{X\ir)P{Tr)  dvr' 


(9) 


Notice  that,  unlike  (3)  and  (5),  Equation  (8)  defines  the  posterior  using  a  true  equality,  not  an  approx¬ 
imation.  It  takes  fully  into  account  our  prior  beliefs  about  what  the  value  of  tt  will  be,  along  with  the 
interaction  of  those  prior  beliefs  with  observed  evidence  X. 

Equations  (8)  and  (9)  provide  one  compelling  answer  to  the  question  we  started  with.  Why  should 
even  discrete-minded  computer  scientists  care  about  integrals?  Because  even  when  the  probability  space 
is  discrete,  we  often  care  about  good  estimates  of  posterior  probabilities.  Computing  integrals  can  help  us 
improve  the  parameter  estimates  in  our  models.5 


4See  http://www.math.uah.edu/STAT/objects/experiments/BetaCoinExperiment.xhtml  for  a  nice  applet  that  lets  you  ex¬ 
plore  this  idea.  If  you  set  a  =  b  =  10,  you  get  a  prior  strongly  biased  toward  0.5,  and  it’s  hard  to  move  the  posterior  too 
far  from  that  value  even  if  you  generate  observed  heads  with  probability  p  =  0.8.  If  you  set  a  =  b  =  2,  there’s  still  a  bias 
toward  0.5  but  it’s  much  easier  to  move  the  posterior  off  that  value.  As  a  second  pointer,  see  some  nice,  self-contained  slides  at 
http:  / /www. cs.cmu.edu/~lewicki/cp-s08/Bayesian-inference. pdf. 

5 Chris  Dyer  (personal  communication)  points  out  you  don’t  have  to  be  doing  Bayesian  estimation  to  care  about  expected 
values.  For  example,  better  ways  to  compute  expected  values  can  be  useful  in  the  E  step  of  expectation-maximization  algorithms, 
which  give  you  maximum  likelihood  estimates  for  models  with  latent  variables.  He  also  points  out  that  for  many  models, 
Bayesian  parameter  estimation  can  be  a  whole  lot  easier  to  implement  than  EM.  The  widely  used  GIZA++  implementation  of 
IBM  Model  3  (a  probabilistic  model  used  in  statistical  machine  translation  [12])  contains  2186  lines  of  code;  Chris  implemented 
a  Gibbs  sampler  for  Model  3  in  67  lines.  On  a  related  note,  Kevin  Knight’s  excellent  “Bayesian  Inference  with  Tears:  A  tutorial 
workbook  for  natural  language  researchers”  [9]  was  written  with  goals  very  similar  to  our  own,  but  from  an  almost  completely 
complementary  angle:  he  emphasizes  conceptual  connections  to  EM  algorithms  and  focuses  on  the  kinds  of  structured  problems 
you  tend  to  see  in  natural  language  processing. 
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1.2  Why  sampling? 


The  trouble  with  integrals,  of  course,  is  that  they  can  be  very  difficult  to  calculate.  The  methods  we  learned 
in  calculus  class  are  fine  for  classroom  exercises,  but  often  cannot  be  applied  to  interesting  problems  in  the 
real  world.  Indeed,  analytical  solutions  to  (8)  and  the  denominator  of  (9)  might  be  impossible  to  obtain,  so 
we  might  not  be  able  to  determine  the  exact  form  of  P(jk\X).  Gibbs  sampling  allows  us  to  sample  from  a 
distribution  that  asymptotically  follows  P(tt\X)  without  having  to  explicitly  calculate  the  integrals. 

1.2.1  Monte  Carlo:  a  circle,  a  square,  and  a  bag  of  rice 

Gibbs  Sampling  is  an  instance  of  a  Markov  Chain  Monte  Carlo  technique.  Let’s  start  with  the  “Monte 
Carlo”  part.  You  can  think  of  Monte  Carlo  methods  as  algorithms  that  help  you  obtain  a  desired  value  by 
performing  simulations  involving  probabilistic  choices.  As  a  simple  example,  here’s  a  cute,  low-tech  Monte 
Carlo  technique  for  estimating  the  value  of  n  (the  ratio  of  a  circle’s  circumference  to  its  diameter).6 

Draw  a  perfect  square  on  the  ground.  Inscribe  a  circle  in  it  —  i.e.  the  circle  and  the  square  are  centered 
in  exactly  the  same  place,  and  the  circle’s  diameter  has  length  identical  to  the  side  of  the  square.  Now  take 
a  bag  of  rice,  and  scatter  the  grains  uniformly  at  random  inside  the  square.  Finally,  count  the  total  number 
of  grains  of  rice  inside  the  circle  (call  that  C),  and  inside  the  square  (call  that  S). 

You  scattered  rice  at  random.  Assuming  you  managed  to  do  this  pretty  uniformly,  the  ratio  between  the 
circle’s  grains  and  the  square’s  grains  (which  include  the  circle’s)  should  approximate  the  ratio  between  the 
area  of  the  circle  and  the  area  of  the  square,  so 


C  „  4I)2 

S  ~  d 2 


(10) 


Solving  for  n,  we  get  n  ss  . 

You  may  not  have  realized  it,  but  we  just  solved  a  problem  by  approximating  the  values  of  integrals.  The 
true  area  of  the  circle,  7r(|)2,  is  the  result  of  summing  up  an  infinite  number  of  infinitessimally  small  points; 
similarly  for  the  the  true  area  d2  of  the  square.  The  more  grains  of  rice  we  use,  the  better  our  approximation 
will  be. 


1.2.2  Markov  Chains:  walking  the  right  walk 

In  the  circle- arid-square  example,  we  saw  the  value  of  sampling  involving  a  uniform  distribution,  since  the 
grains  of  rice  were  distributed  uniformly  within  the  square.  Returning  to  the  problem  of  computing  expected 
values,  recall  that  we’re  interested  in  Eprx)[f(x)]  (equation  7),  where  we’ll  assume  that  the  distribution  p(x) 
is  not  uniform  and,  in  fact,  not  easy  to  work  with  analytically. 

Figure  2  provides  an  example  f(z)  and  p(z)  for  illustration.  Conceptually,  the  integral  in  equation  (7) 
sums  up  f(z)p(z)  over  infinitely  many  values  of  z.  But  rather  than  touching  each  point  in  the  sum  exactly 
once,  here’s  another  way  to  think  about  it:  if  you  sample  N  points  z^\  z^2\  . . .  ,z^N^  at  random  from 
the  probability  density  p(z),  then 


1  N 

Ep(z)[f(z)}=  (11) 

00  v  t— i 

That  looks  a  lot  like  a  kind  of  averaged  value  for  /,  which  makes  a  lot  of  sense  since  in  the  discrete  case 
(equation  6)  the  expected  value  is  nothing  but  a  weighted  average,  where  the  weight  for  each  value  of  2  is 
its  probability. 

Notice,  though,  that  the  value  in  the  sum  is  just  f(z^),  not  f(z^)p(z^)  as  in  the  integral  in  equation  (7). 
Where  did  the  p(z)  part  go?  Intuitively,  if  we’re  sampling  according  to  p(z),  and  count (2)  is  the  number  of 

6We’re  elaborating  on  the  introductory  example  at  http://en.wikipedia.org/wiki/Monte_Carlo_method. 
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f(z) 


Figure  2:  Example  of  computing  expectation  Ep^[f(z)].  (This  figure  was  adapted  from  page  7  of  the 
handouts  for  Chris  Bishop’s  presentation  “NATO  ASI:  Learning  Theory  and  Practice”,  Leuven,  July  2002, 
http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-nato-2.pdf  ) 


times  we  observe  z  in  the  sample,  then  ^count(^)  approaches  p(z)  as  N  — »  oo.  So  the  p(z)  is  implicit  in  the 
way  the  samples  are  drawn. ' 

Looking  at  equation  (11),  it’s  clear  that  we  can  get  an  approximate  value  by  sampling  only  a  finite  number 
of  times,  T: 


iW/(A>]^E/(*(t))-  (!2) 

t= l 

Progress!  Now  we  have  a  way  to  approximate  the  integral.  The  remaining  problem  is  this:  how  do  we 
sample  z^°\  z^\  z@\  . . . ,  z ^T')  according  to  p(z)l 

There  are  a  whole  variety  of  ways  to  go  about  this;  e.g.,  see  Bishop  [2]  Chapter  11  for  discussion  of  rejec¬ 
tion  sampling,  adaptive  rejection  sampling,  adaptive  rejection  Metropolis  sampling,  importance  sampling, 
sampling-importance-sampling,...  For  our  purposes,  though,  the  key  idea  is  to  think  of  z’s  as  points  in  a 
state  space,  and  find  ways  to  “walk  around”  the  space  —  going  from  z ^  to  z^  to  and  so  forth  —  so 
that  the  likelihood  of  visiting  any  point  z  is  proportional  to  p(z).  Figure  2  illustrates  the  reasoning  visually: 
walking  around  values  for  Z,  we  want  to  spend  our  time  adding  values  f(z)  to  the  sum  when  p(z)  is  large, 
e.g.  devoting  more  attention  to  the  space  between  z\  and  22,  as  opposed  to  spending  time  in  less  probable 
portions  of  the  space  like  the  part  between  22  and  23. 

A  walk  of  this  kind  can  be  characterized  abstractly  as  follows: 


1:  :=  a  random  initial  point 

2:  for  t  =  1  to  T  do 
3:  2<‘+1)  :=  g(zW) 

4:  end  for 


'  Okay  —  we’re  playing  a  little  fast  and  loose  here  by  talking  about  counts:  with  Z  continuous,  we’re  not  going  to  see  two 
identical  samples  so  it  doesn’t  really  make  sense  to  talk  about  counting  how  many  times  a  value  was  seen.  But  we 

did  say  “intuitively” ,  right? 
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Here  g  is  a  function  that  makes  a  probabilistic  choice  about  what  state  to  go  to  next  according  to  an 
explicit  or  implicit  transition  probability  -Ptrans(2^+1)  l2^°\  ■■■  i  z^)-8 

The  part  about  probabilistic  choices  makes  this  a  Monte  Carlo  technique.  What  will  make  it  a  Markov 
Chain  Monte  Carlo  technique  is  defining  things  so  that  the  next  state  you  visit,  z^t+1\  depends  only  on  the 
current  state  z^\  That  is, 

A rans(*(t+1)k(0),  *(1),  •  •  • ,  *W)  =  Arans(*(‘+1) (13) 

For  you  language  folks,  this  is  precisely  the  same  idea  as  modeling  word  sequences  using  a  bigram  model, 
where  here  we  have  states  z  instead  of  having  words  w. 

We  included  the  subscript  trans  in  our  notation,  so  that  -Ptrans  can  be  rea<i  as  “transition  probability” , 
in  order  to  emphasize  that  these  are  state-to-state  transition  probabilities  in  a  (first-order)  Markov  model. 
The  heart  of  Markov  Chain  Monte  Carlo  methods  is  designing  g  so  that  the  probability  of  visiting  a  state  z 
will  turn  out  to  be  p(z),  as  desired.  This  can  be  accomplished  by  guaranteeing  that  the  chain,  as  defined  by 
the  transition  probabilities  -Ptrans ,  meets  certain  conditions.  Gibbs  sampling  is  one  algorithm  that  meets 
those  conditions.9 

1.2.3  The  Gibbs  sampling  algorithm 

Gibbs  sampling  is  applicable  in  situations  where  Z  has  at  least  two  dimensions,  i.e.  each  point  z  is  really 
z  =  (zi, . . .  ,Zk),  with  k  >  1.  The  basic  idea  in  Gibbs  sampling  is  that,  rather  than  probabilistically  picking 
the  next  state  all  at  once,  you  make  a  separate  probabilistic  choice  for  each  of  the  k  dimensions,  where  each 
choice  depends  on  the  other  k  —  1  dimensions.10  That  is,  the  probabilistic  walk  through  the  space  proceeds 
as  follows: 


1 

2 

3 

4 

5 

6 


*(°>  :=  (z[°\...,z 

for  t  =  1  to  T  do 
for  i  =  1  to  k  do 


(0)\ 

k  / 


r(*+l) 

end  for 
end  for 


P(Zi\ 


„(*+!) 


v(t+i)  At) 


**+!>  •  •  •  > 


Note  that  we  can  obtain  the  distribution  we  are  sampling  from  by  using  the  definition  of  conditional 
probability: 


P(Zi 


v(t+i) 


Jt+l)  (i) 

"i—  1  >*»+!>' 


A*) 


)  = 


P( 


At+A 


(t+i)  A*)  A*) 


*  ^i-i 


,  z„' 


A*) 


p(At+1 )  At+L)  A*>  -Wl 

r  \*1  f*“i  *i-l  ) 

Notice  that  the  only  difference  between  the  numerator  and  the  denominator  is  that  the  numerator  is  the  full 
joint  probability,  including  zf* ,  whereas  zf^  is  missing  in  the  denominator.  This  will  be  important  later. 
One  full  execution  of  the  inner  loop  and  you’ve  just  computed  your  new  point  z^t+A  =  g{z^))  = 


At+A  AA 


A*h 


(14) 


At+A 


A*+A\ 


You  can  think  of  each  dimension  Zi  as  corresponding  to  a  parameter  or  variable  in  your  model.  Using 
equation  (14),  we  sample  the  new  value  for  each  variable  according  to  its  distribution  based  on  the  values 
of  all  the  other  variables.  During  this  process,  new  values  for  the  variables  are  used  as  soon  as  you  obtain 
them.  For  the  case  of  three  variables: 


•  The  new  value  of  z\  is  sampled  conditioning  on  the  old  values  of  Z2  and  z 3. 

8We’re  deliberately  staying  at  a  high  level  here.  In  the  bigger  picture,  g  might  consider  and  reject  one  or  more  states  before 
finally  deciding  to  accept  one  and  return  it  as  the  value  for  zA+A  _  gee  discussions  of  the  Metropolis-Hastings  algorithm,  e.g. 
Bishop  [2],  Section  11.2. 

®We  told  you  we  were  going  to  keep  theory  to  a  minimum,  didn’t  we? 

1 11  There  are,  of  course,  variations  on  this  basic  scheme.  For  example,  “blocked  sampling”  groups  the  variables  into  b  <  k 
blocks  and  the  variables  in  each  block  are  sampled  together  based  on  the  other  6—1  blocks. 
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•  The  new  value  of  z 2  is  sampled  conditioning  on  the  new  value  of  z\  and  the  old  value  of  z 3. 

•  The  new  value  of  Z3  is  sampled  conditioning  on  the  new  values  of  z\  and  z^. 

1.3  The  remainder  of  this  document 

So,  there  you  have  it.  Gibbs  sampling  makes  it  possible  to  obtain  samples  from  probability  distributions 
without  having  to  explicitly  calculate  the  values  for  their  marginalizing  integrals,  e.g.  computing  expected 
values,  by  defining  a  conceptually  straightforward  approximation.  This  approximation  is  based  on  the  idea 
of  a  probabilistic  walk  through  a  state  space  whose  dimensions  correspond  to  the  variables  or  parameters  in 
your  model. 

Trouble  is,  from  what  we  can  tell,  most  descriptions  of  Gibbs  sampling  pretty  much  stop  there  (if  they’ve 

even  gone  into  that  much  detail).  To  someone  relatively  new  to  this  territory,  though,  that’s  not  nearly  far 

enough  to  figure  out  how  to  do  Gibbs  sampling.  How  exactly  do  you  implement  the  “sampling  from  the 
following  distribution”  part  at  the  heart  of  the  algorithm  (equation  14)  for  your  particular  model?  How  do 
you  deal  with  continuous  parameters  in  your  model?  How  do  you  actually  generate  the  expected  values  you’re 
ultimately  interested  in  (e.g.  equation  8),  as  opposed  to  just  doing  the  probabilistic  walk  for  T  iterations? 

Just  as  the  first  part  of  this  document  aimed  at  explaining  why,  the  remainder  aims  to  explain  how. 
In  Section  2,  we  take  a  very  simple  probabilistic  model  —  Naive  Bayes  —  and  describe  in  considerable 
(painful?)  detail  how  to  construct  a  Gibbs  sampler  for  it.  This  includes  two  crucial  things,  namely  how  to 
employ  conjugate  priors  and  how  to  actually  sample  from  conditional  distributions  per  equation  (14). 

In  Section  3  we  discuss  how  to  actually  obtain  values  from  a  Gibbs  sampler,  as  opposed  to  merely  watching 
it  walk  around  the  state  space.  (Which  might  be  entertaining,  but  wasn’t  really  the  point.)  Our  discussion 
includes  convergence  and  burn-in,  auto-correlation  and  lag,  and  other  practical  issues. 

In  Section  4  we  conclude  with  pointers  to  other  things  you  might  find  it  useful  to  read,  as  well  as  an 
invitation  to  tell  us  how  we  could  make  this  document  more  accurate  or  more  useful. 

2  Deriving  a  Gibbs  Sampler  for  a  Nai've  Bayes  Model 

In  this  section  we  consider  Naive  Bayes  models.11  Let’s  assume  that  items  of  interest  are  documents,  that 
the  features  under  consideration  are  the  words  in  the  document,  and  that  the  document-level  class  variable 
we’re  trying  to  predict  is  a  sentiment  label  whose  value  is  either  0  or  1.  For  ease  of  reference,  we  present 
our  notation  in  Figure  3.  Figure  4  describes  the  model  as  a  “plate  diagram”,  to  which  we  will  refer  when 
describing  the  model. 

2.1  Modeling  How  Documents  are  Generated 

We  represent  each  document  as  a  bag  of  words.  Given  an  unlabeled  document  W j,  our  goal  is  to  pick 
the  best  label  Ly-  =  0  or  L j  =  1.  Sometimes  we  will  refer  to  0  and  1  as  classes  instead  of  labels.  In 
further  discussion  it  will  be  convenient  for  us  to  refer  to  the  sets  of  documents  sharing  the  same  label,  so  for 
notational  convenience  we  define  the  sets  Co  =  {WjjLj  =  0}  and  Ci  =  {Wj|Lj  =  1}.  The  usual  treatment 
of  these  models  is  to  equate  “best”  with  “most  probable”,  and  therefore  our  goal  is  to  choose  the  label  L j 
for  W j  that  maximizes  P(Lj|Wj).  Applying  Bayes’s  Rule, 


Lj  =  argmaxP(L|Wi)  =  argmax 

l  l  P(W  j) 

=  argmax  P(Wj|L)P(L), 

L 

where  the  denominator  P(Wj)  is  omitted  because  it  does  not  depend  on  L. 

This  application  of  Bayes’s  rule  (the  “Bayes”  part  of  “Naive  Bayes”)  allows  us  to  think  of  the  model  in 
terms  of  a  “generative  story”  that  accounts  for  how  documents  are  created.  According  to  that  story,  we 

!  1  We  assume  the  reader  is  familiar  with  Naive  Bayes.  For  a  refresher,  see  [14]. 
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number  of  words  in  the  vocabulary. 

number  of  documents  in  the  corpus. 

hyperparameters  of  the  Beta  distribution. 

hyperparameter  vector  for  the  multinomial  prior. 

pseudocount  for  word  i. 

set  of  documents  labeled  x. 

the  set  of  all  documents. 

number  of  documents  labeled  0  (1). 

document  j’s  frequency  distribution. 

frequency  of  word  i  in  document  j. 

vector  of  document  labels. 

label  for  document  j. 

number  of  words  in  document  j. 

probability  of  word  i. 

probability  of  word  i  from  the  distribution  of  class  x. 

number  of  times  word  i  occurs  in  the  set  of  all  documents  labeled  x. 

set  of  all  documents  except  W j 

vector  of  all  document  labels  except  L j 

number  of  documents  labeled  0(1)  except  for  W j 

set  of  hyperparameters  (7^1, 7^0)  le) 


Figure  3:  Notation. 


Figure  4:  Naive  Bayes  plate  diagram 


first  pick  the  class  label  of  the  document,  L ; ;  our  model  will  assume  that’s  done  by  flipping  a  coin  whose 
probability  of  heads  is  some  value  7r  =  P(Lj  =  1).  We  can  express  this  a  little  more  formally  as 

L j  ~  Bernoulli(7r). 

Then,  for  every  one  of  the  Rj  word  positions  in  the  document,  we  pick  a  word  wy  independently  by 
sampling  randomly  according  to  a  probability  distribution  over  words.  Which  probability  distribution  we 
use  is  based  on  the  label  L j  of  the  document,  so  we’ll  write  them  as  do  and  0  \ .  Formally  one  would  describe 
the  creation  of  document  j’s  bag  of  words  as 

Wj  ~  Multinomial(.Rj,  0).  (15) 

The  assumption  that  the  words  are  chosen  independently  is  the  reason  we  call  the  model  “naive” . 

Notice  that  logically  speaking,  it  made  sense  to  describe  the  model  in  terms  of  two  separate  probability 
distributions,  do  and  B\ ,  each  one  being  a  simple  unigram  distribution.  The  notation  in  (15)  doesn’t  explicitly 
show  that  what  happens  in  generating  Wj  depends  on  whether  Lj  was  0  or  1.  Unfortunately,  that  notational 
choice  seems  to  be  standard,  even  though  it’s  less  transparent.12  We  indicate  the  existence  of  two  9s  by 
including  the  2  in  the  lower  rectangle  of  Figure  4,  but  many  plate  diagrams  in  the  literature  would  not. 
Another,  perhaps  clearer  way  to  describe  the  process  would  be 

Wj  ~  Multinomial(i?j,  dLj).  (16) 

And  that’s  it:  our  “generative  story”  for  the  creation  of  a  whole  set  of  labeled  documents  (Wn,Lra), 
according  to  the  Naive  Bayes  model,  is  that  this  simple  document-level  generative  story  gets  repeated  N 
times,  as  indicated  by  the  N  in  the  upper  rectangle  in  Figure  4. 

2.2  Priors 

Well,  ok,  that’s  not  completely  it.  Where  did  tt  come  from?  Our  generative  story  is  going  to  assume  that 
before  this  whole  process  began,  we  also  picked  tt  randomly.  Specifically  we’ll  assume  that  7r  is  sampled  from 
a  Beta  distribution  with  parameters  7^1  and  7^0-  These  are  referred  to  as  hyperparameters  because  they  are 
parameters  of  a  prior,  which  is  itself  used  to  pick  parameters  of  the  model.  In  Figure  4  we  represent  these  two 
hyperparameters  as  a  single  two-dimensional  vector  7^  =  (7^-1, 7^0)-  When  =  7^0  =  1,  Betafan  1,771-0) 
is  just  a  uniform  distribution,  which  means  that  any  value  for  tt  is  equally  likely.  For  this  reason  we  call 
Beta{  1,1)  an  “uninformed  prior”. 

Similarly,  where  do  Qq  and  6 1  come  from?  Just  as  the  Beta  distribution  can  provide  an  uninformed  prior 
for  a  distribution  making  a  two-way  choice,  the  Dirichlet  distribution  can  provide  an  uninformed  prior  for 
U-way  choices,  where  V  is  the  number  of  words  in  the  vocabulary.  Let  7 s  be  a  U-dimensional  vector  where 
the  value  of  every  dimension  equals  1.  If  dQ  is  sampled  from  Dirichlet  (7$),  every  probability  distribution 
over  words  will  be  equally  likely.  Similarly,  we’ll  assume  Q\  is  sampled  from  Dirichlet (7#). 13  Formally 

tt  ~  Beta(77r) 

6  ~  Dirichlet(70) 

Choosing  the  Beta  and  Dirichlet  distributions  as  priors  for  binomial  and  multinomial  distributions,  re¬ 
spectively,  helps  the  math  work  out  cleanly.  We’ll  discuss  this  in  more  detail  in  Section  2.4.2. 

12The  actual  basis  for  removing  the  subscripts  on  the  parameters  9  is  that  we  assume  the  data  from  one  class  is  independent 
of  the  parameter  estimation  of  all  the  other  classes,  so  essentially  when  we  derive  the  probability  expressions  for  one  class  the 
others  look  the  same  [19]. 

1  ’'Note  that  Qq  and  0 \  are  sampled  separately.  There’s  no  assumption  that  they  are  related  to  each  other  at  all.  Also,  it’s 
worth  noting  that  a  Dirichlet  distribution  is  a  Beta  distribution  if  the  dimension  V  =  2.  Dirichlet  generalizes  Beta  in  the  same 
way  that  multinomial  generalizes  binomial.  Now  you  see  why  we  took  the  trouble  to  represent  7^1  and  7^0  as  a  2-dimensional 
vector  ■y7T . 
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2.3  State  space  and  initialization 

Following  Pedersen  [17,  18],  we’re  going  to  describe  the  Gibbs  sampler  in  a  completely  unsupervised  setting 
where  no  labels  at  all  are  provided  as  training  data.  We’ll  then  briefly  explain  how  to  take  advantage  of 
labeled  data. 

State  space.  Recall  that  the  job  of  a  Gibbs  sampler  is  to  walk  through  an  ^-dimensional  state  space  defined 
by  the  random  variables  (Z\ ,  Z2,  ■  •  •  Zk)  in  the  model.  Every  point  in  that  walk  is  a  collection  (z\ ,  z2, • • •  Zk) 
of  values  for  those  variables. 

In  the  Naive  Bayes  model  we’ve  just  described,  here  are  the  variables  that  define  the  state  space. 

•  one  scalar-valued  variable  7 r 

•  two  vector- valued  variables,  9 q  and  9\ 

•  binary  label  variables  L,  one  for  each  of  the  N  documents 

We  also  have  one  vector  variable  W j  for  each  of  the  N  documents,  but  these  are  observed  variables,  i.e. 
their  values  are  already  known  (and  which  is  why  Wjk  is  shaded  in  Figure  4). 

Initialization.  The  initialization  of  our  sampler  is  going  to  be  very  easy.  Pick  a  value  n  by  sampling  from 
the  Beta(7wi,7^o)  distribution.  Then,  for  each  j,  flip  a  coin  with  success  probability  7 r,  and  assign  label  l/;0) 

—  that  is,  the  label  of  document  j  at  the  0t,!  iteration  based  on  the  outcome  of  the  coin  flip.  Similarly, 
you  also  need  to  initialize  90  and  6 1  by  sampling  from  Dirichlet(7e). 

2.4  Deriving  the  Joint  Distribution 

Recall  that  for  each  iteration  t  =  1 . . .  T  of  sampling,  we  update  every  variable  defining  the  state  space  by 
sampling  from  its  conditional  distribution  given  the  other  variables,  as  described  in  equation  (14). 

Here’s  how  we’re  going  to  proceed: 

•  We  will  define  the  joint  distribution  of  all  the  variables,  corresponding  to  the  numerator  in  (14). 

•  We  simplify  our  expression  for  the  joint  distribution. 

•  We  use  our  final  expression  of  the  joint  distribution  to  define  how  to  sample  from  the  conditional 
distribution  in  (14). 

•  We  give  the  final  form  of  the  sampler  as  pseudocode. 

2.4.1  Expressing  and  simplifying  the  joint  distribution 

According  to  our  model,  the  joint  distribution  for  the  entire  document  collection  is  P(C,  L,  7r,  0q,  9\\  7^1, 7^0, 7e)- 
The  semicolon  indicates  that  the  values  to  its  right  are  parameters  for  this  joint  distribution.  Another  way 
to  say  this  is  that  the  variables  to  the  left  of  the  semicolon  are  conditioned  on  the  hyperparameters  given  to 
the  right  of  the  semicolon.  Using  the  model’s  generative  story,  and,  crucially,  the  independence  assumptions 
that  are  a  part  of  that  story,  the  joint  distribution  can  decomposed  into  a  product  of  several  factors:14 

P(n\^1,^0)P(L\n)P(9ohe)P(Oihe)P(<Co\Oo,'L)P(Ci\e1,-L) 

Let’s  look  at  each  of  these  in  turn. 

14Note  that  we  can  also  obtain  the  products  of  the  joint  distribution  directly  from  our  graphical  model,  Figure  4  by  multiplying 
together  each  latent  variable  conditioned  on  its  parents. 
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•  P(7r|77ri,7Tro)-  The  first  factor  is  the  probability  of  choosing  this  particular  value  of  7 r  given  that  7^1 
and  7^0  are  being  used  as  the  hyperparameters  of  the  Beta  distribution.  By  definition  of  the  Beta 
distribution,  that  probability  is: 

17.1,7.0)  =  ^1+^0)7r7,i-1(i-7r)7.°-1  (17) 

T(7wi)r(7T0) 

And  because 

r(7.i  +  7tto) 

T(77ri)T(7^o) 

is  a  constant  that  doesn’t  depend  on  7r,  we  can  rewrite  this  as: 

P(7T|7irl,7iro)  =  C  7T7,rl_1(l  -  7 r)7'0”1.  (18) 

The  constant  c  is  a  normalizing  constant  that  makes  sure  P(n\'yni,  7^0)  sums  to  1  over  all  n.  T(x) 
is  the  gamma  function,  a  continuous-valued  generalization  of  the  factorial  function.  We  could  also 
express  (18)  as 


P(7r|7iri)7iro)  OC  tt7"1  1  (l  -  it)1*0  1.  (19) 

•  P(L|7t).  The  second  factor  is  the  probability  of  obtaining  this  specific  sequence  L  of  N  binary  labels, 
given  that  the  probability  of  choosing  label  =  1  is  it.  That’s 

N 

P(L|tt)  =  ]j7rL"(l-7r)(1-L")  (20) 

71=1 

=  7TCl(l-7r)Co  (21) 

Recall  from  Figure  3  that  Co  and  C\  are  nothing  more  than  the  number  of  documents  labeled  1  and 
0  respectively.15 

•  P(#o|7e)  and  P(6i\je).  The  third  factors  are  the  probability  of  having  sampled  these  particular  choices 
of  word  distributions,  given  that  7e  was  used  as  the  hyperparameter  of  the  Dirichlet  distribution. 

Since  the  6  distributions  are  independent  of  each  other,  let’s  make  the  notation  a  bit  easier  to  read  here 
and  consider  each  of  them  in  isolation,  allowing  us  to  momentarily  elide  the  subscript  saying  which  one 
is  which.  By  the  definition  of  the  Dirichlet  distribution,  the  probability  of  each  word  distribution  is 


P(*  he) 


_  r(Z)Li7fli)  TTg>7(H-i 

nr=ir(7«)iv 

=  c'fler-1 

i—1 

1=1 


(22) 

(23) 

(24) 


Recall  that  7 denotes  the  value  of  vector  7e’s  ith  dimension,  and  similarly,  0.,  is  the  value  for  the 
ith  dimension  of  vector  6.  i.e.  the  probability  assigned  by  this  distribution  to  the  ith  word  in  the 
vocabulary,  d  is  another  normalization  constant  that  we  can  discard  by  changing  the  equality  to  a 
proportionality. 

15  Of  course  we  can  represent  these  quantities  given  the  variables  we  already  have,  but  we  define  these  in  the  interests  of 
simplifying  the  equations  somewhat. 
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•  P(Cq|0q,L)  and  P(Ci|0i,L).  These  are  the  probabilities  of  generating  the  contents  of  the  bags  of 
words  in  each  of  the  two  document  classes. 

Generating  the  bag  of  words  W„  for  document  n  depends  on  that  document’s  label,  Ln  and  the  word 
probability  distribution  associated  with  that  label,  0l„  (so  $l„  is  either  9 o  or  6 1).  For  notational 
simplicity,  let’s  let  6  =  6 

v 

P(W„  |L,  0Ln)  =  \\9^  (25) 

i— 1 

Here  9i  is  the  probability  of  word  i  in  distribution  9,  and  the  exponent  W is  the  frequency  of  word  % 
in  Wn. 

Now,  since  the  documents  are  generated  independently  of  each  other,  we  can  multiply  the  value  in  (25) 
for  each  of  the  documents  in  each  class  to  get  the  combined  probability  of  all  the  observed  bags  of 
words  within  a  class: 


p(cx\l,9X)  =  n  n  o™r 

ndzCx  i= 1 

=  IK'-'" 

i=l 

Where  Afcx  (i)  gives  the  count  of  word  i  in  documents  with  class  label  x. 

2.4.2  Choice  of  Priors  and  Simplifying  the  Joint  Probability  Expression 

So  why  did  we  pick  the  Dirichlet  distribution  as  our  prior  for  9 o  and  9 1  and  the  Beta  distribution  as  our 
prior  for  7r ?  Let’s  look  at  what  happens  in  the  process  of  simplifying  the  joint  distribution  and  see  what 
happens  to  our  estimates  of  9  (where  this  can  be  either  9 q  or  9 1)  and  n  once  we  observe  some  evidence  (i.e. 
the  words  from  a  single  document).  Using  (19)  and  (21)  from  above: 


(26) 

(27) 


P{l T|L;7irl,7ir0) 


=  P(L|7t)P(7t|7w1,7w0) 

oc  \ttCi  (1  —  n)Col\  [7T7,rl_1(l  —  7r)7,rCI_1] 
OC  7rc'1+7’ri-1(l  _  t^Co+TVo-1 


Likewise,  for  9  using  (24)  and  (25)  from  above: 


P{0  |W„;7e) 


OC 


OC 


P(Wn\9)P(9\lg) 

ne-ne-1 

i= 1  i= 1 

V 

ni+lGi  —  1 

i= 1 


If  we  use  the  words  in  all  of  the  documents  of  a  given  class,  then  we  have: 


P(9x\Cx;le) 


n* 


x,i 


(0+ 


(28) 

(29) 

(30) 


(31) 


(32) 
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Notice  that  (30)  is  an  unnormalized  Beta  distribution,  with  parameters  C\  +  7^1  and  Co  +  7-n-o>  and  (32) 
is  an  unnormalized  Dirichlet  distribution,  with  parameter  vector  (A/ca.  ('*)  +  7 ei)  for  1  <  *  <  V.  When  the 
posterior  probability  distribution  is  of  the  same  family  as  the  likelihood  probability  distribution  —  that  is,  the 
same  functional  form,  just  with  different  arguments  —  it  is  said  to  be  the  conjugate  prior  of  the  posterior. 
The  Beta  distribution  is  the  conjugate  prior  for  binomial  (and  Bernoulli)  distributions  and  the  Dirichlet 
distribution  is  the  conjugate  prior  for  multinomial  distributions.  Also  notice  what  role  the  hyperparameters 
played — they  are  added  just  like  observed  evidence.  It  is  for  this  reason  that  the  hyperparameters  are 
sometimes  referred  to  as  pseudocounts. 

Okay,  so  if  we  multiply  together  the  individual  factors  from  Section  2.4.1  as  simplified  above  (in  Section 
2.4.2)  and  let  fi  =  (7^1, 7^0, 7@)  we  can  express  the  full  joint  distribution  as: 


P( C,L,7T,  0o,0i;  ju) 


oc 


7j-C'i+77ri  —  1  ^  Co +771-0  —  I 


V 

n^A/co(«)+70i-l/3A/'c1  (i)+70*-l 
“o,i  “l,i 


i= 1 


(33) 


2.4.3  Integrating  out  n 

Having  illustrated  how  to  derive  the  joint  distribution  for  a  model,  as  in  (33),  it  turns  out  that,  for  this 
particular  model,  we  can  make  our  lives  a  little  bit  simpler:  we  can  reduce  the  effective  number  of  parameters 
in  the  model  by  integrating  our  joint  distribution  with  respect  to  7r.  This  has  the  effect  of  taking  all  possible 
values  of  7 r  into  account  in  our  sampler,  without  representing  it  as  a  variable  explicitly  and  having  to  sample 
it  at  every  iteration.  Intuitively,  “integrating  out”  a  variable  is  an  application  of  precisely  the  same  principle 
as  computing  the  marginal  probability  for  a  discrete  distribution.  For  example,  if  we  have  an  expression  for 
P(a,  b,  c),  we  can  compute  P(a,  b)  by  summing  over  all  possible  values  of  c,  i.e.  P(a,  b)  =  )T)c  P(a,  b,  c).  As 
a  result,  c  is  “there”  conceptually,  in  terms  of  our  understanding  of  the  model,  but  we  don’t  need  to  deal 
with  manipulating  it  explicitly  as  a  parameter.  With  a  continuous  variable,  the  principle  is  the  same,  but 
we  integrate  over  all  possible  values  of  the  variable  rather  than  summing. 

So,  we  have 


P(L,C,  0o,0i;£t)  =  /  P(L,C,  0o,0i,  Tr,fi)  cItt 

J  7 T 


(34) 


=  /  P(^|7-i,7xo)P(Lk)P(0o+)mi7e)-P(Co|0o,L)P(C1|01,L)  cItt  (35) 

J  7T 

=  P(0o|79)mi7e)-P(Co|0o+)P(C1|01,L)  f  P(7r|7Tl,7„o)P(L|7r)  dvr  (36) 

J  7T 

At  this  point  let’s  focus  our  attention  on  the  integrand  only  and  substitute  the  true  distributions  from 
(17)  and  (21). 


-P^+l^+P+K)  d7T 


/  r(i  _  7r)7„o-i7rc1(1  _  7r)Co  d7r 

Jtt  147^1^(7^0) 

r(7.i  +  7+  r  7rc1+7.1-i(1_7r)c0+7.0-i  d7r 

147+147+  j v 


(37) 

(38) 


Here’s  where  our  use  of  conjugate  priors  pays  off.  Notice  that  the  integrand  in  (38)  is  a  Beta  distribution  with 
parameters  C\  +  7^1  and  Co  +  7ti-o-  This  means  that  the  value  of  the  integral  is  just  the  normalizing  constant 
for  that  distribution,  which  is  easy  to  look  up  (e.g.  in  the  entry  for  the  Beta  distribution  in  Wikipedia):  the 
normalizing  constant  for  distribution  Beta(Ci  +  7^1,  Cq  +  7^0)  is 


r(C0  +  Cl  +  7-n-i  +  7x0) 
I\Cl  +  77ri)T(C0  +  7+ 
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Making  that  substitution  in  (38),  and  also  substituting  N  =  Co  +  C±,  we  arrive  at 


d7r 


r(7TTi  +  7^o)  T(N  +  7^1  +  7^o) 

r(7wi)r(7^0)  r(c,i  +  7*-i)r(Cb  +  7^0) 


(39) 


Substituting  (39)  and  the  definitions  of  the  probability  distributions  from  Section  2.4.1  back  into  (36) 
gives  us 


P( L,C,0o,0i;/i)  oc 


r(7-7ri  +  7tto)  T(N  +  7^1  +  7^o) 

r(77ri)r(77ro)  r(Ci  +  y^r^Co  +  7^0) 


n«-w+w-i 


i= 1 


jA/ci  (»)+70*  —  1 

1  ,i 


(40) 


Okay,  so  it  would  be  reasonable  at  this  point  to  ask,  “I  thought  the  point  of  integrating  out  n  was  to 
simplify  things,  so  why  did  we  just  ’simplify’  the  joint  expression  by  adding  in  a  bunch  of  these  Gamma 
functions  everywhere?”  Good  question-hold  that  thought  until  we  derive  the  sampler. 


2.5  Building  the  Gibbs  Sampler 

The  definition  of  a  Gibbs  sampler  specifies  that  in  each  iteration  we  assign  a  new  value  to  variable  Zt  by 
sampling  from  the  conditional  distribution 


So,  for  example,  to  assign  the  value  of  L^i+1\  we  need  to  compute  this  conditional  distribution:16 

P{U  |lW,...,lW,C,0?),  <;/*), 

To  assign  the  value  of  l4*+1\  we  need  to  compute 


P(L2  IL^.Lf,...,^,* 


and  so  forth  for  Lg4-1-1'*  through  L^+1\  To  assign  the  value  of  9q  we  need  to  compute 
Similarly,  to  assign  the  value  of  9 o  we  need  to  sample  from  the  conditional  distribution 


P(G 


(i+l) 

0 


It  (*+1)  t  (*+1) 
Tl  >  lj2  5 


■  (t+1) 

-‘N  ) 


}(*) . 


M), 


and,  for  9 1, 


P{9(i+1) \L[t+1)  ,L(2t+1) , 


■  (t+i) 
JN  ’ 


>  ^0  ’ 


p), 


Intuitively,  at  the  start  of  an  iteration  t,  we  have  a  collection  of  all  our  current  information  at  this 
point  in  the  sampling  process.  That  information  includes  the  word  count  for  each  document,  the  number  of 
documents  labeled  0,  the  number  of  documents  labeled  1,  the  word  counts  for  all  documents  labeled  0,  the 
word  counts  for  all  documents  labeled  1,  the  current  label  for  each  document,  the  current  values  of  9 o  and  9 1, 
etc.  When  we  want  to  sample  the  new  label  for  document  j,  we  temporarily  remove  all  information  (i.e.  word 
counts  and  label  information)  about  this  document  from  that  collection  of  information.  Then  we  look  at  the 
conditional  probability  that  Lj  =  0  given  all  the  remaining  information,  and  the  conditional  probability  that 
Lj  =  1  given  the  same  information,  and  we  sample  the  new  label  L by  choosing  randomly  according  to 

the  relative  weight  of  those  two  conditional  probabilities.  Sampling  to  get  the  new  values  9^+  i  1  and  0^,+1) 
operates  according  to  the  same  principal. 

16There’s  no  superscript  on  the  bags  of  words  C  because  they’re  fully  observed  and  don’t  change  from  iteration  to  iteration. 
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2.5.1  Sampling  for  Document  Labels 

Okay,  so  how  do  we  actually  do  the  sampling?  We’re  almost  there.  As  the  final  step  in  our  journey,  we  show 
how  to  select  all  of  the  new  document  labels  Lj  and  the  new  distributions  9 0  and  61  during  each  iteration 
of  the  sampler.  By  definition  of  conditional  probability, 


^>(Li|L(_-7’),C(~-J’),  00, ;  n) 


P(L„  W^LbAC^flo,^;/!) 

P(  L,C,9o,ei;n) 


(41) 

(42) 


where  L^~^  are  all  the  document  labels  except  Lj,  and  is  the  set  of  all  documents  except  Wr  The 

distribution  is  over  two  possible  outcomes,  Lj  =  0  and  Lj  =  1. 

Notice  that  the  numerator  is  just  the  complete  joint  probability  described  in  (40).  In  the  denominator, 
we  have  the  same  expression,  minus  all  the  information  about  document  W j.  Therefore  we  can  work  out 
what  (42)  should  look  like  by  considering  the  three  factors  in  (40),  one  at  a  time.  For  each  factor,  we  will 
remind  ourselves  of  what  it  looks  like  in  the  numerator  (which  includes  Wj),  and  work  out  what  it  should 
look  like  in  the  denominator  (excluding  Wj),  for  each  of  the  two  outcomes. 

The  first  factor  in  (40), 


r(7-7rl  +  7-n-o)  /,o\ 

r(7.i)r(7.o)’  1  j 

is  very  easy.  It  depends  only  on  the  hyperparameters,  so  removing  information  about  W j  has  no  impact 
on  its  value.  Since  it  will  be  the  same  in  both  the  numerator  and  denominator  of  (42),  it  will  cancel  out. 
Excellent!  Two  factors  to  go. 

Let’s  look  at  the  second  factor  of  (40).  Taking  all  documents  into  account  including  Wj,  this  is 


L(N  +  Tn-1  +  7tto) 
r(C'i  +  7.i)r(C0  +  7tto) 


(44) 


Now,  in  order  to  compute  the  denominator  of  (40),  we  remove  document  Wj  from  consideration.  How  does 
that  change  things?  It  depends  on  what  Lj  was  during  the  previous  iteration.  Whether  Lj  =  0  or  Lj  =  1 
during  the  previous  iteration,  the  corpus  size  is  effectively  reduced  from  N  to  N  —  1,  and  the  size  of  one  of 
the  document  classes  is  smaller  by  one  compared  to  its  value  in  the  numerator.  If  Lj  =  0  when  we  remove 
it,  then  we  will  have  C g  =  Co  —  1  and  C\  =  C\.  If  Lj  =  1  during  the  previous  iteration,  then  we  will 
have  Co  =  Cq~ ^  and  C{  ^  =  C\  —  1.  In  each  case,  removing  Wj  only  changes  the  information  we  know 
about  one  class,  which  means  that  of  the  two  terms  in  the  denominator  of  this  factor,  one  of  them  is  going 
to  be  the  same  in  the  numerator  and  the  denominator  of  (42).  That’s  going  to  cause  the  terms  from  one  of 
the  classes  (the  one  Wj  did  not  belong  to  after  the  previous  iteration)  to  cancel  out. 

Let  x  €  {0, 1}  be  the  outcome  we’re  considering,  i.e.  the  one  for  which  =  Cx  —  l.17  If  we  use  x 

and  reorganize  the  expression  a  bit,  the  second  factor  of  (42)  can  be  rewritten  from 


r(iV+77ri+7Jro) 

r(Ci+7^i)r(c0+7Jo) 

r(jV+7^i+7^o-l) 

r(c-<-J')+7.o)r(c<-J>+7^i) 

to 

L(N  +  7^i  +  77ro)r(CE  + 

7 tvx  1)  (45) 

r  (cx  +  77T*)r(iv  +  7^1  +  7^0  —  i) 

17Crucially,  note  that  x  here  is  not  ,  the  label  of  document  j  at  the  previous  iteration.  We  are  pulling  document  j  out 
of  the  collection,  effectively  obviating  our  knowledge  of  its  current  label,  and  then  constructing  a  distribution  over  the  two 
possibilities,  L j  =  0  and  L j  =  1.  So  we  need  for  our  expression  to  take  a;  as  a  parameter,  allowing  us  to  build  a  probability 
distribution  by  asking  “what  if  x  =  0?”  and  “what  if  x  =  1?”. 
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Using  the  fact  that  T(a  +  1)  =  ar(a)  for  all  a ,  we  can  simplify  further  to  get 


CX  J-  'y-TTX 

N  +  7,rl  +  7^-0  —  1 

Look,  no  more  pesky  Gammas! 

Finally,  the  third  factor  in  (40)  is 

v 

J^gA/co0)+79i-l^c1(i)+7ei-l 

i—1 


(46) 


(47) 


When  we  look  at  what  changes  when  we  remove  W j  from  consideration,  in  order  to  write  the  corresponding 
expression  in  the  denominator  of  (42),  we  see  that  it  will  behave  in  the  same  way  as  the  second  factor.  One 
of  the  classes  remains  unchanged,  so  one  of  the  terms,  either  the  9 q  or  9 i,  will  cancel  out  when  we  do  the 
division.  If,  similar  to  above,  we  let  x  be  the  class  for  which  Cx~^  =  Cx  —  1,  then  we  can  again  capture 
both  cases  in  one  expression: 


n 


nNcx  (*)+7fli-l 

Vx,i _ 

•/V'c(-i)(i)+7ei-l 

•* 


V 


i= 1 


(48) 


With  that,  we  have  finished  working  through  the  three  factors  in  (40),  which  means  we  have  worked  out 
how  to  express  the  numerator  and  denominator  of  (42),  factor  by  factor.  Recombining  the  factors,  we  get 
the  following  expression  for  the  conditional  distribution  in  (42):  for  x  £  {0, 1}, 


Py(Lj  =  x\Li-j\c^-j\eQ:e1-lfJ,) 


Cx  “I-  'Jirx 

N  +  7tti  +  7tt0  —  1 


n« 

i=  1 


nW  , 


(49) 


Let’s  take  a  step  back  and  take  a  look  at  what  this  equation  is  telling  us  about  how  a  label  is  chosen.  Its 
first  factor  gives  us  an  indication  of  how  likely  it  is  that  L j  =  x  considering  only  the  distribution  of  the  other 
labels.  So,  for  example,  if  the  corpus  had  more  class  0  documents  than  class  1  documents,  this  factor  would 
tend  to  push  the  label  toward  class  0.  Its  second  factor  is  like  a  word  distribution  “fitting  room.”  We  get  an 
indication  of  how  well  the  words  in  W j  “fit”  with  each  of  the  two  distributions.  If,  for  example,  the  words 
from  W j  “fit”  better  with  distribution  9 o  (by  having  a  larger  value  for  the  document  probability  using  9 o) 
then  this  factor  will  push  the  label  toward  class  0  as  well. 

So  (finally!)  here’s  the  actual  procedure  to  sample  from  the  conditional  distribution  in  (42): 


1.  Let  valued  =  expression  (49)  with  x  =  0 


2.  Let  valuel  =  expression  (49)  with  x  =  1 


3.  Let  the  distribution  be 


value  0 


value  1 


valueO-\- valuel  ’  valueO-\-v aluel  ' 


4. 


Select  the  value  of  L as  the  result  of  a  Bernoulli  trial  (weighted  coin  flip)  according  to  this  distri¬ 
bution. 


2.5.2  Sampling  for  6 

We’ll  follow  a  similar  procedure  to  determine  how  to  sample  for  new  values  of  9 o  and  Q\.  Since  the  estimation 
of  the  two  distributions  is  independent  of  one  another,  we’re  going  to  omit  the  subscripts  on  6  to  make  the 
notation  a  bit  easier  to  digest.  Just  like  above,  we’ll  need  to  derive  an  expression  for  the  probability  of  9 
given  all  other  variables,  but  our  work  is  a  bit  simpler  in  this  case.  Observe, 


P(0|<C,L;/i)  a  P(C,L\9)P{9\n) 


(50) 
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Furthermore,  recall  that,  since  we  used  conjugate  priors,  this  posterior,  like  the  prior,  works  out  to  be  a 
Dirichlet  distribution.  We  actually  derived  the  full  expression  in  Section  2.4.2,  but  we  don’t  need  the  full 
expression  here.  All  we  need  to  do  to  sample  a  new  distribution  is  to  make  another  draw  from  a  Dirichlet 
distribution,  but  this  time  with  parameters  (i)  +  70;  for  each  i  in  V.  For  notational  convenience,  let’s 
define  the  V  dimensional  vector  t  such  that  each  ti  =  (i)  +7 00  where  x  is  again  either  0  or  1  depending 

on  which  6  we  are  resampling.  We  then  sample  a  new  Q  as: 

6  ~  Dirichlet(t)  (51) 

How  do  you  actually  implement  sampling  from  a  new  Dirichlet  distribution?  To  sample  a  random  vector 
a  =  (ai, . . . ,  ay)  from  the  U-dimensional  Dirichlet  distribution  with  parameters  (an, . . . ,  ay),  one  fast  way 
is  to  draw  V  independent  samples  y  -[f ,  t/y  from  gamma  distributions,  each  with  density 

O' j  —  1  — 

Gamma(a,,  1)  =  *  6  — ,  (52) 

r(a») 


and  then  set  ai  =  yt/  1  Vj  (i-e->  just  normalize  each  of  the  gamma  distribution  draws).18 

2.5.3  Taking  advantage  of  documents  with  labels 

Using  labeled  documents  is  relatively  painless:  just  don’t  sample  L j  for  those  documents!  Always  keep 
Lj  equal  to  the  observed  label.  The  documents  will  effectively  serve  as  “ground  truth”  evidence  for  the 
distributions  that  created  them.  Since  we  never  sample  for  their  labels,  they  will  always  contribute  to  the 
counts  in  (49)  and  (51)  and  will  never  be  subtracted  out. 

2.5.4  Putting  it  all  together 

Initialization.  Define  the  priors  as  in  Section  2.2  and  initialize  them  as  described  in  Section  2.3. 


1:  for  t  :=  1  to  T  do 

2:  for  j  :=  1  to  N  do 

3:  if  j  is  not  a  training  document  then 

4:  Subtract  j’s  word  counts  from  the  total  word  counts  of  whatever  class  it’s  currently  a  member  of 

5:  Subtract  1  from  the  count  of  documents  with  label  Lj 

6:  Assign  a  new  label  L ^i+1'  to  document  j  as  described  at  the  end  of  Section  2.5.1 

7:  Add  1  to  the  count  of  documents  with  label  L^+1') 

8:  Add  j’s  word  counts  to  the  total  word  counts  for  class  Iqt+1) 

9:  end  if 

10:  end  for 

11:  to  :=  vector  of  total  word  counts  from  class  0,  including  pseudocounts 

12:  do  ~  Dirichlet(to),  as  described  in  Section  2.5.2 

13:  ti  :=  vector  of  total  word  counts  from  class  1,  including  pseudocounts 

14:  0 1  ~  Dirichlet  (ti),  as  described  in  Section  2.5.2 

15:  end  for 


Sampling  iterations.  Notice  that  as  soon  as  a  new  label  for  L j  is  assigned,  this  changes  the  counts  that 
will  affect  the  labeling  of  the  subsequent  documents.  This  is,  in  fact,  the  whole  principle  behind  a  Gibbs 
sampler! 

That  concludes  the  discussion  of  how  sampling  is  done.  We’ll  see  how  to  get  from  the  output  of  the 
sampler  to  estimated  values  for  the  variables  in  Section  3. 

18For  details,  see  http://en.wikipedia.org/wiki/Dirichlet_distribution  (version  of  April  12,  2010). 
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2.6  Optional:  A  Note  on  Integrating  out  Continuous  Parameters 

At  this  point  you  might  be  asking  yourself  why  we  were  able  to  integrate  out  the  continuous  parameter  it 
from  our  model,  but  did  not  do  something  similar  with  the  two  word  distributions  6a  and  0\.  The  idea  of 
doing  this  even  looks  promising,  but  there’s  a  subtle  problem  that  will  get  us  into  trouble  and  end  up  leaving 
us  with  an  expression  filled  with  T  functions  that  will  not  cancel  out.  Let  us  go  through  the  derivations  and 
see  where  it  leads  us.  If  you  follow  this  piece  of  the  discussion,  then  you  really  understand  the  details!19 

Our  goal  here  would  be  to  obtain  the  probability  that  a  document  was  generated  from  the  same  dis¬ 
tribution  that  generated  the  words  of  a  particular  class  of  documents,  C^.  We  would  then  use  this  as  a 
replacement  for  the  product  in  (48).  We  start  first  by  calculating  the  probability  of  making  a  single  word 
draw  given  the  other  words  in  the  class,  subtracting  out  the  information  about  W; .  In  the  equations  that 
follow  there’s  an  implicit  (—j)  superscript  on  all  of  the  counts.  If  we  let  wk  denote  the  word  at  some  position 
k  in  W j  then, 


Pr (wk  =  y |C£  9);7 e) 


Pr(wk  =  y\6)Pr(O\Ci-»ng)d0 


I A 


a  P(Si=A"ca,  (*)  +  lOi)  TT  n 

P.r.?/  t/-  I  I  v. 


/a 


do 


i=l 

V 


^IK 


vMcx  (*)+78i  — 1 


do 


i= 1 


r(E,=iM:x(f)  +  I6i)  f 

Uti  P^Ca,  (*)  +  lei)  J  A 

r(ELr^cx(f)  +  I6i) 

nLir(M:.(i)+7«) 

r(EjlAc.(0  +  le%)  r(-/vcx(y)  +iey  +  1)  nHiA»^i/r(-^cx  W  +  lei) 


f  ^{v)+16v  ^w+w-i de 

•J  A  • - I  A  n  i 


nlliP^Ca.W  +lei) 
■McAy)  +  ley 
Er=Acx  (i)  +  le% 


P(Ejli-Mcx(*)  +  lei  +  1) 


(53) 

(54) 

(55) 

(56) 

(57) 

(58) 


The  process  we  use  is  actually  the  same  as  the  process  used  to  integrate  7 r,  just  in  the  multidimensional 
case.  The  set  A  is  the  probability  simplex  of  0,  namely  the  set  of  all  6  such  that  JT  9i  =  1.  We  get  to  (54) 
by  substitution  from  the  formulas  we  derived  in  Section  2.4.2,  then  (55)  by  factoring  out  the  normalization 
constant  for  the  Dirichlet  distribution,  since  it  is  constant  with  respect  to  6.  Note  that  the  integrand  of 
(56)  is  actually  another  Dirichlet  distribution,  so  its  integral  is  its  normalization  constant  (same  reasoning 
as  before).  We  substitute  this  in  to  obtain  (57).  Using  the  property  of  T  that  T(x+  1)  =  xP(x)  for  all  x,  we 
can  again  cancel  all  of  the  T  terms. 

At  this  point,  even  though  we  have  a  simple  and  intuitive  result  for  the  probability  of  drawing  a  single 
word  from  a  Dirichlet,  we  actually  need  the  probability  of  drawing  all  words  in  W,  from  the  Dirichlet 
distribution.  What  we’d  really  like  to  do  is  assume  that  the  words  within  a  particular  document  are  drawn 
from  the  same  distribution,  and  just  calculate  the  probability  of  W j  by  multiplying  (58)  over  all  words  in  the 
vocabulary.  But  we  cannot  do  that,  since,  without  the  values  of  6  being  known,  we  cannot  make  independent 
draws  from  a  Dirichlet  distribution  since  our  draws  have  an  effect  on  what  our  estimate  of  6  is! 

We  can  see  this  two  different  ways.  First,  instead  of  drawing  one  word  in  equation  (53),  do  the  derivation 
by  drawing  two  words  at  a  time.20  You’ll  find  that  once  you  hit  (57),  you’ll  have  an  extra  set  of  Gamma 
functions  that  won’t  cancel  out  nicely.  The  second  way  to  see  it  is  actually  by  looking  at  the  plate  diagram 
for  the  model,  Figure  4.  Each  6  effectively  has  Rj  arrows  coming  out  of  it  to  individual  words,  so  every 

19The  authors  thank  Wu  Ke,  who  really  understood  the  details,  for  pointing  out  our  error  in  an  earlier  version  of  this  document 
and  providing  the  illustrating  example  we  go  through  next. 

20This  is  what  Wu  Ke  did  to  demonstrate  his  point  to  us. 
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word  within  a  document  is  in  the  Markov  blanket  of  the  others;  therefore  we  can’t  assume  that  the  words 
are  independent  without  a  fixed  9. 21 

The  lesson  here  is  to  be  careful  when  you  integrate  out  parameters.  If  you’re  doing  a  single  draw  from  a 
multinomial,  then  integrating  out  a  continuous  parameter  can  make  the  sampler  simpler,  since  you  won’t  have 
to  sample  for  it  at  every  iteration.  If,  on  the  other  hand,  you  do  multiple  draws  from  the  same  multinomial, 
integration  (although  possible)  will  result  in  an  expression  that  involves  Gamma  functions.  Calculating 
Gamma  functions  is  undesirable  since  they  are  computationally  expensive,  so  they  can  slow  down  a  sampler 
significantly. 

3  Producing  values  from  the  output  of  a  Gibbs  sampler 

The  initialization  and  sampling  iterations  in  the  Gibbs  sampling  algorithm  will  produce  values  for  each  of 
the  variables,  for  iterations  t  =  1, 2, . . . ,  T.  In  theory,  the  approximated  value  for  any  variable  Zi  can  simply 
be  obtained  by  calculating: 


1 

T 


T 


t—t 


(59) 


as  discussed  in  equation  (12).  However,  expression  (59)  is  not  always  used  directly.  There  are  several 
additional  details  to  note  that  are  a  part  of  typical  sampling  practice.22 


Convergence  and  burn-in  iterations.  Depending  on  the  values  chosen  during  the  initialization  step, 
it  may  take  some  time  for  the  Gibbs  sampler  to  reach  a  point  where  the  points  (z^\  z%\  ■  ■  ■  z^)  are  all 
coming  from  the  stationary  distribution  of  the  Markov  chain,  which  is  an  assumption  of  the  approximation 
in  (59).  In  order  to  avoid  the  estimates  being  contaminated  by  the  values  at  iterations  before  that  point, 
some  practitioners  generally  discard  the  values  at  iterations  t  <  B,  which  are  referred  to  as  the  “burn-in” 
iterations,  so  that  the  average  in  (59)  is  taken  only  over  iterations  B  +  1  through  T.23 

Autocorrelation  and  lag.  The  approximation  in  (59)  assumes  the  samples  for  Z,  are  independent,  but 
we  know  they’re  not,  because  they  were  produced  by  a  process  that  conditions  on  the  previous  point  in  the 
chain  to  generate  the  next  point.  This  is  referred  to  as  autocorrelation  (sometimes  serial  autocorrelation), 
i.e.  correlation  between  successive  values  in  the  data.24  In  order  to  avoid  autocorrelation  problems  (so  that 
the  chain  “mixes  well”),  many  implementations  of  Gibbs  sampling  average  only  every  Lth  value,  where  L  is 
referred  to  as  the  lag.25  In  this  context,  Jordan  Boyd-Graber  (personal  communication)  also  recommends 
looking  at  Neal’s  [15]  discussion  of  likelihood  as  a  metric  of  convergence. 

21The  Markov  blanket  of  a  node  in  a  graphical  model  consists  of  that  node’s  parents,  its  children,  and  the  coparents  of  its 
children.  [16]. 

22Jason  Eisner  (personal  communication)  argues,  with  some  support  from  the  literature,  that  burn-in,  lag,  and  multiple 
chains  are  in  fact  unnecessary  and  it  is  perfectly  correct  to  do  a  single  long  sampling  run  and  keep  all  samples.  See  [4,  5], 
MacKay  ([13],  end  of  section  29.9,  page  381)  and  Roller  and  Friedman  ([10],  end  of  section  12.3.5.2,  page  522). 

23 As  far  as  we  can  tell,  there  is  no  principled  way  to  choose  the  “right”  value  for  B  in  advance.  There  are  a  variety  of 
ways  to  test  for  convergence,  and  to  measure  autocorrelation;  see,  e.g.,  Brian  J.  Smith,  “boa:  An  R  Package  for  MCMC 
Output  Convergence  Assessment  and  Posterior  Inference”,  Journal  of  Statistical  Software,  November  2007,  Volume  21,  Issue 
11,  http://www.jstatsoft.org/  for  practical  discussion.  However,  from  what  we  can  tell,  many  people  just  choose  a  really  big 
value  for  T,  pick  B  to  be  large  also,  and  assume  that  their  samples  are  coming  from  a  chain  that  has  converged. 

24Lohninger  [11]  observes  that  “most  inferential  tests  and  modeling  techniques  fail  if  data  is  autocorrelated” . 

25  Again,  the  choice  of  L  seems  to  be  more  a  matter  of  art  than  science:  people  seem  to  look  at  plots  of  the  autocorrelation 
for  different  values  of  L  and  use  a  value  for  which  the  autocorrelation  drops  off  quickly.  The  autocorrelation  for  variable  Zi 
with  lag  L  is  simply  the  correlation  between  the  sequence  and  the  sequence  Z^  L\  Which  correlation  function  is  used 
seems  to  vary. 
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Multiple  chains.  As  is  the  case  for  many  other  stochastic  algorithms  (e.g.  expectation  maximization  as 
used  in  the  forward-backward  algorithm  for  HMMs),  people  often  try  to  avoid  sensitivity  to  the  starting 
point  chosen  at  initialization  time  by  doing  multiple  runs  from  different  starting  points.  For  Gibbs  sampling 
and  other  Markov  Chain  Monte  Carlo  methods,  these  are  referred  to  as  “multiple  chains”  ,26 

Hyperparameter  sampling.  Rather  than  simply  picking  hyperparameters,  it  is  possible,  and  in  fact 
often  critical,  to  assign  their  values  via  sampling  (Boyd-Graber,  personal  communication).  See,  e.g.,  Wallach 
et  al.  [20]  and  Escobar  and  West  [3]. 

4  Conclusions 

The  main  point  of  this  document  has  been  to  take  some  of  the  mystery  out  of  Gibbs  sampling  for  computer 
scientists  who  want  to  get  their  hands  dirty  and  try  it  out.  Like  any  other  technique,  however,  caveat  lector: 
using  a  tool  with  only  limited  understanding  of  its  theoretical  foundations  can  produce  undetected  mistakes, 
misleading  results,  or  frustration. 

As  a  first  step  toward  getting  further  up  to  speed  on  the  relevant  background,  Ted  Pedersen’s  [18] 
doctoral  dissertation  has  a  very  nice  discussion  of  parameter  estimation  in  Chapter  4,  including  a  detailed 
exposition  of  an  EM  algorithm  for  Naive  Bayes  and  his  own  derivation  of  a  Naive  Bayes  Gibbs  sampler  that 
highlights  the  relationship  to  EM.  (He  works  through  several  iterations  of  each  algorithm  explicitly,  which 
in  our  opinion  merits  a  standing  ovation.)  The  ideas  introduced  in  Chapter  4  are  applied  in  Pedersen  and 
Bruce  [17];  note  that  the  brief  description  of  the  Gibbs  sampler  there  makes  an  awful  lot  more  sense  after 
you’ve  read  Pedersen’s  dissertation  chapter. 

We  also  recommend  Gregor  Heinrich’s  [7]  “Parameter  estimation  for  text  analysis.”  Heinrich  presents 
fundamentals  of  Bayesian  inference  starting  with  a  nice  discussion  of  basics  like  maximum  likelihood  es¬ 
timation  (MLE)  and  maximum  a  posteriori  (MAP)  estimation,  all  with  an  eye  toward  dealing  with  text. 
(We  followed  his  discussion  closely  above  in  Section  1.1.)  Also,  his  is  one  of  the  few  papers  we’ve  been 
able  to  find  that  actually  provides  pseudo-code  for  a  Gibbs  sampler.  Heinrich  discusses  in  detail  Gibbs 
sampling  for  the  widely  discussed  Latent  Dirichlet  Allocation  (LDA)  model,  and  his  corresponding  code  is 
at  http://www.arbylon.net/projects/LdaGibbsSampler.java. 

For  a  textbook-style  exposition,  see  Bishop  [2].  The  relevant  pieces  of  the  book  are  a  little  less  stand¬ 
alone  than  we’d  like  (which  makes  sense  for  a  course  on  machine  learning,  as  opposed  to  just  trying  to  dive 
straight  into  a  specific  topic);  Chapter  11  (Sampling  Methods)  is  most  relevant,  though  you  may  also  find 
yourself  referring  back  to  Chapter  8  (Graphical  Models) . 

Those  ready  to  dive  into  the  topic  of  Markov  Chain  Monte  Carlo  in  more  depth  might  want  to  start 
with  Andrieu  et  al.  [1].  We  and  Andrieu  et  al.  appear  to  differ  somewhat  on  the  semantics  of  the  word 
“introduction,”  which  is  one  of  the  reasons  the  document  you’re  reading  exists. 

Finally,  for  people  interested  in  the  use  of  Gibbs  sampling  for  structured  models  in  NLP  (e.g.  parsing), 
the  right  place  to  start  is  undoubtedly  Kevin  Knight’s  excellent  “Bayesian  Inference  with  Tears:  A  tutorial 
workbook  for  natural  language  researchers”  [9],  after  which  you’ll  be  equipped  to  look  at  Johnson,  Griffiths, 
and  Goldwater  [8].2'  The  leap  from  the  models  discussed  here  to  those  kinds  of  models  actually  turns  out 
to  be  a  lot  less  scary  than  it  appears  at  first.  The  main  thing  to  observe  is  that  in  the  crucial  sampling 
step  (equation  (14)  of  Section  1.2.3),  the  denominator  is  just  the  numerator  without  zf  \  the  variable  whose 
new  value  you’re  choosing.  So  when  you’re  sampling  conditional  distributions  (e.g.  Sections  2.5. 1-2. 5. 4) 
in  more  complex  models,  the  basic  idea  will  be  the  same:  you  subtract  out  counts  related  to  the  variable 
you’re  interested  in  based  on  its  current  value,  compute  proportions  based  on  the  remaining  counts,  then 

26Again,  there  seems  to  be  as  much  art  as  science  in  whether  to  use  multiple  chains,  how  many,  and  how  to  combine  them  to 
get  a  single  output.  Chris  Dyer  (personal  communication)  reports  that  it  is  not  uncommon  simply  to  concatenate  the  chains 
together  after  removing  the  burn-in  iterations. 

27As  an  aside,  our  travels  through  the  literature  in  writing  this  document  led  to  an  interesting  early  use  of  Gibbs  sampling 
with  CFGs:  Grate  et  al.  [6].  Johnson  et  al.  had  not  come  across  this  when  they  wrote  their  seminal  paper  introducing  Gibbs 
sampling  for  PCFGs  to  the  NLP  community  (and  in  fact  a  search  on  scholar.google.com  turned  up  no  citations  in  the  NLP 
literature).  Mark  Johnson  (personal  communication)  observes  that  the  “local  move”  Gibbs  sampler  Grate  et  al.  describe  is 
specialized  to  a  particular  PCFG,  and  it’s  not  clear  how  to  generalize  it  to  arbitrary  PCFGs. 
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pick  probabilistically  based  on  the  result,  and  finally  add  counts  back  in  according  to  the  probabilistic  choice 
you  just  made. 
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